Arabic Dialect Identification Using a Parallel Multidialectal Corpus
نویسندگان
چکیده
We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a meta-classifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving 74% accuracy against a random baseline of 16.7% and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as 94%, but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (76%). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2,000 sentences from the MPCA, we classify over 26k sentences from the radically different AOC dataset with 74% accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with 97% accuracy. We find that character n-grams are a very informative feature for this task, in both withinand cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined. Keywords-Arabic Dialects; Automatic Dialect Identification; Parallel Corpus; Text Classification;
منابع مشابه
A Multidialectal Parallel Corpus of Arabic
The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. Howev...
متن کاملColing • Acl 2006 Tag + 8
This paper discusses a novel probabilistic synchronous TAG formalism, synchronous Tree Substitution Grammar with sister adjunction (TSG+SA). We use it to parse a language for which there is no training data, by leveraging off a second, related language for which there is abundant training data. The grammar for the resource-rich side is automatically extracted from a treebank; the grammar on the...
متن کاملDomain and Dialect Adaptation for Machine Translation into Egyptian Arabic
In this paper, we present a statistical machine translation system for English to Dialectal Arabic (DA), using Modern Standard Arabic (MSA) as a pivot. We create a core system to translate from English to MSA using a large bilingual parallel corpus. Then, we design two separate pathways for translation from MSA into DA: a two-step domain and dialect adaptation system and a one-step simultaneous...
متن کاملMultidialectal Spanish acoustic modeling for speech recognition
During the last years, language resources for speech recognition have been collected for many languages and specifically, for global languages. One of the characteristics of global languages is their wide geographical dispersion, and consequently, their wide phonetic, lexical, and semantic dialectal variability. Even if the collected data is huge, it is difficult to represent dialectal variants...
متن کاملMonodialectal and multidialectal infants' representation of familiar words.
Monolingual infants are typically studied as a homogenous group and compared to bilingual infants. This study looks further into two subgroups of monolingual infants, monodialectal and multidialectal, to identify the effects of dialect-related variation on the phonological representation of words. Using an Intermodal Preferential Looking task, the detection of mispronunciations in familiar word...
متن کامل